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SUMMARY* A method for feature extraction from protein sequences has been developed which is 
based on an artificial neural filter system. Amino acid sequences are analyzed with regard to physico- 
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hydrophobic core region starting at position -6. Further striking features and dominant positions can be 
found for all three types of cleavage sites. * Academic p»». me. 
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as precursor proteins in the cytoplasm of the cell. Proteins targeted to the organelles are processed inside of 
them by special proteases which cleave off N-terminal targeting sequences (1-6). The cleavage-site region 
of precursor proteins consists of amino acid sequences showing no or only little homology (3, 7, 8). 
Nevertheless, statistical investigations revealed that the cleavage-site region is locally encoded and thus a 
good candidate for the development of prediction systems. A clearly defined cleavagc-site region exists for 
E.coli precursor proteins (7, 9) as well as for secretory proteins translocated through the endoplasmic 
reticulum membrane of eukaryotes.and for many organelle types (10-12). Since many new sequences will 
be compiled in the future by sequencing of whole chromosomes from yeast (13) and human, among other 
functional important sequences the classification of targeting peptides and their cleavage-sites of nuclear 
encoded organelle proteins will be helpful for understanding genome organization. Until now, alignment 
procedures and statistical approaches led to the discovery of several rules describing cleavage-site regions 
already, e.g. the well known "-3. -1 rule" (7, 9) for eubacterial and eukaryotic secretory targeting signals. 
Using these rules for prediction leads to accuracies less than 100% which still is not sufficient for an 
accurate analysis of chromosome organization. Reliable prediction systems for the detection of cleavage- 
sites exist only for secretory signals of eubacterial and eukaryotic precursors (1, 14, 15, 16). It has been 
reported that artificial neural networks arc - in principle - able to recognize complicated sequence patterns 
automatically, i.e. without giving biased instructions by the investigator (15-13). For an identification of 
locally encoded sequence patterns improved results for sequence classification and prediction can already be 
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obtained with a Perceptron-type neural network although it still has limitations due to its architecture (19). 
In this work we applied a Percepcron-type neural network to the identification of cleavage-site features 
which allows a property-based analysis of protein sequence data combined with a neural net feature 
extraction (Figure 1). Eubacterial signaJ peptides (targeting signals for export and secretion), chloroplast 
transit peptides, and mitochondrial transit peptides were analyzed. In contrast to most earlier approaches to 
this problem, the cleavage-site sequences are described in terms of four physico-chemical amino acid 
properties which have been identified to be useful for cleavage-site analysis (15, 16, 20): Hydrophobicity 
(21), hydrophilicity (22), polarity (23), and side-chain volume (24) (Table 1). A complete description of 
amino acid sequences in terms of physico-chemical properties will be an ultimate necessity to understand 
the function and structure of proteins, but unfortunately such a description is not possible today. Thus, we 
focussed on only four amino acid properties which appear to be important. This data representation is 
thought to reveal characteristic cleavage-site features which cannot be found looking at the sequence leete: 
code. 

An evolutionary computing algorithm, the simple Evolution Strategy (25), was used for network training 
instead of the commonly used generalizing delta rule (26). Evolutionary algorithms are efficient 
optimization techniques which have been shown to be very useful and produce reliable results when applied 
to artificial neural networks (15, 16). A dominant feature of a neural network system is its ability to 
determine which positions and which residues are important for a certain protein structure or function. Most 
common for the analysis of sequence data is a three-layer feedforward network architecture which could be 
shown to be able to approximate any continuous input-output relation (27, 28), e.g. the identification of 
cleavage-sites in precursor sequences. Although these multilayer networks appear to be well suited for the 
development of prediction systems they lack the possibility to explicitely show what the important sequence 
features are. On the one hand this is due to the non-symbolic technique itself, on the other hand many 
investigations using neural networks are based on a sequence representation in terms of binary numbers 
representing characters rather than physico-chemical property values. To overcome this disadvantage we 
used a two-layer network architecture for the analysis of cleavage-site features which allows a property- 
based analysis of protein sequence data combined with a neural network feature extraction. The obtained 
optimized networks have been analyzed using Hinton-diagrams which offers the possibilty to interprete a 
network's weight values w,y (Figure 1). 

. METHODS 

Data selection and preparation: 24 E.coli periplasmic protein precursors, 27 chloroplast precursor 
sequences from spinach, and 39 human mitochondrial precursor proteins with experimentally confirmed 
cleavage-sites were selected from the SwissProt database, Rel. 20. The sequences were randomly divided 
into training sets and test sets following a ratio of 7:3 for every type of precursor. This resulted in'l7 Ecoli 
19 chloroplast, and 27 mitochondrial training sequences. These data were used for training of the 
Perception system. The remaining test set sequences were used for an evaluation of the networks 
generalization ability by measuring the prediction accuracy of the optimized networks when applied to th. 
corresponding test set. We are aware that e.g. cross-validation tests are more precise compared to an 
evaluation with regard to a single test set. Our aim was not to establish a new prediction method. Rather, 
we want to present a useful technique for protein data analysis using physico-chemical amino acid 
properties. A complete list of the data sets including names and sequences of the precursors is available 
from the authors on request. For training of the networks on the detection of cleavage-site features the 
training data were restricted to sequence strings covering 10 residues of the targeting sequence and 2 
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ri jes of the N-terminal end of the mature protein (positions -10 to +2). These 12-residuc windows 
served as positive examples for network training. For every positive example 3 negative examples were 
selected randomly from the whole precursor sequence, e.g. the whole training data for the E.coh cleavage- • 
site consisted of (17 positive + 51 negative = 68) sequence windows. 

Network architecture and training technique: A model of our Perceptron-type neural network is 
eiven in Figure 1* The amino acids of a given sequence window are numerically described by fourphysico- 
Ihemical side chain properties: Hydrophobicity (21). hydrophilicity (22). polarity (23). and volume (24), 
leading to a (12 x 4) properry matrix. The scales were normalized to give comparable values between 0.0 
and 1 0 (Table 1). Every value of the input property matrix x S j is connected by a weight factor w i} . The 
single output unit calculates the network's output value (0.0 < y < 1 .0) for a given sequence window using 
a sigmoid function F(x) = 1 / (1 + exp(-x)) as transfer function: 

y = F(£ xijwjj). 

In contrast to the classical Perceptron system (26). the output unit of our network employs a sigmoidal 
transfer function Therefore, the separation of positive and negative examples of a training set is not 
restricted to linear separation (19). The task was to find the correct weight values w, y . which allow an 
optimal separation between positive and negative examples. For this, a (l,100)-evoluiion strategy with 
adaptive stepsize control was used (25). This optimization method employs a systematic top down search in 
the feature space imitating the natural process of repeated mutation and selection. Here, the weight values 
and the stepsize in a leamins cycle were the parameters for this mutation-selection procedure (generate-and- 
test cycle) startins with random values. This evolutionary algorithm has already been successfully applied 
to the optimization of neural networks for pattern recognition in protein sequences (15,16). It is described 
there. The best weight values of a learning cycle are selected following an external quality function. We 
used the minimization of the square error A 2 for supervision of the learning process: 



n 

a 2 = £ (t p • y p ) 2 . 
P =i 



Amino acid Properties Output unit Output value 
characters 




F d>ij w ii> =y 
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Figure 1. Model of the Perceptron architecture. Amino acid sequences are represented in terms of four 
physico-chemical properties^. The single output unit uses a sigmoidal function F(x) as transfer function. 
The output value y for a sequence window of 12 residues is calculated by the given formula. For clarity, 
only the connections for the first and the last residue of the input window are drawn. In total, 48 network 
connection weights w,y had to be optimized. 
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Tabic 1. The normalized property scales used for the sequence description. References, sec text. 



Here, n is the number of sequence windows in the training set, t p is the desired output value (1.0 for a 
positive example, 0.0 for a negative example), and y p is the actual network output for the irainins sequence 
p. Starting with random weight values this quality function served as a heuristic for a systematic"search f 0 
optimal values. To reduce the effect of overlearning, i.e. a specialization of the network on the trainine sci 
sequences rather than the extraction of a general feature, a second simple statistical quality function Q (29) 
was used to determine the termination time of the learning process. Q calculates the prediction accuracy of 
the network: * J 



Q ? 



P + N 
tot 



Here. P is the positive correct prediction, N is the negative correct prediction, and tot is the total number of 
investigated residues. If Q reaches the value 1.0 (no prediction error) the learnins process will terminate. A 
maximum of 200 learning cycles (generations) was allowed.A neural filter network for reliable recognition 
of cleavage-site features must show a high Q value (max. 1.0) and a low absolute square error of i . 
output value. Overprediction of cleavage-sites (false positive prediction) will occur if the network ha* 
learned" a too general feature, underprediction (false negative prediction) will occur if a special training set 
feature is used for prediction. In both cases the neural filter system does not show a sood generalization 
ability. ° c 



RESULTS 

Network training: For the E. coli targeting peptide cleavage-site a feature could be found leading to 
100% correct classification of training and test set examples (Q) with a final output error of 0.74 (Table 2 
The features found for chloroplast and mitochondrial precursor cleavage-sites allowed 99% and 97% 
correct classification of the training data, respectively. The Q values for the independent test sets arc lower 
(72% and 79%), the corresponding ouput errors are higher than for the E. coli sequences (1,17 and 4.89) 
indicating that no generalizing feature could be found. 
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0.1699 
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0.6756 
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0.5625 
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0.3435 
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0. 1938 
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0.9558 


0.3041 
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Cys 


0.8938 


0.3750 


0.2846 


0.2836 




Gin 


0.5125 


0.5625 


0.6788 


0.4997 
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Glu 


0.2563 


1.0 


0.9596 


0.4669 




Gly 


0.8313 


0.6875 


0.0 


0.0 




His 


0.5813 


0.4531 


0.9923 


0.5551 
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Ue 


0.9625 


0.2500 


0.2500 


0.6357 
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Met 
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0.2750 


0.6130 
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Phe 
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0.1406 


0.6731 


0.7740 




Pro 


0.7563 


0.5313 


0.303S 


0.3733 




Ser 


0.8063 


0.5781 


0.3212 


0.1723 




Thr 


0.3438 


0.718S 


0.3192 


0.3339 




Trp 


0.8875 


0.0 


0.4038 


1.0 
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0.7250 


0.1719 


0.3096 


0.7961 
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0.4764 
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Table 2. The filters for the three investigated types of cleavage-sites were evaluate umk r~»„A . L . 
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A critical step in the computer experiments was the choice of the size of the input laver. The number of 1 "> 
units was chosen because it appears to an ideal window size for eubacterial cleavage-sites (1) In fact a 
prediction accuracy (Q) of IQQ% CO uld be achieved for the E.coli sequences (Table 2). Thus, we conclude 
that the stgnal for cleavage by periplastic signal peptidase is locally encoded and completelv described by a 
short stretch of amino acids. Here, the sequence representation in terms of the four phvsico-chemical 
properties (Table 1) was sufficient for the extraction of a generalizing cleavage-site feature. Chloroplast 
signals appear to be more iocaiiy encoded (Q=99%) than mitochondrial cleavage-sites (Q=97%) aithou<>h 
this statement is not very well supported. It is striking that in these two cases the data preparation was not 
.deal. Several empirical paramerters of the neural network system, e.g. the transfer function or the size of 
the input matrix, should be optimized for further analysis of the organelle sequences, too. Recently, a 
systematic approach for an optimization of neural network parameters which is also based on evolutionary 
computing algorithms has been proposed (30). 

Interpretation of the Hinton-diagrams: An important aspect of our Perceptrcn-type network 
architecture ,s the use of an amino acid property matrix as input instead of the common 20-bit binary 
number coding for a residue. The sequence representation as a property matrix of real numbers allows a 
quas,-symbol,c interpretation of the network weights in a comprehensible and biochemically meanineful 
way. A simple way to do this is a graphical representation similar to a Hinton diagram (Figure 2) The 
obtained weight values were normalized and splitted into four groups, see legend Figure 2 We interprete 
extreme weight values - indicated by black or white squares - to show the most important sequence 
positions. Nevertheless, a whole diagram must be regarded as the respective cleavage-site feature for E coli 
(Figure 2A). chloroplasts (Figure 2B) and mitochondria (Figure 2C). It must be stressed that only the 
properties hydrophobicity and volume are not correlated, while all other properties are substantially 
correlated. For this reason, special care must be taken when interpreting the Hinton diagrams. 

Description of E.coli signal peptide cleavage-site patterns: E.coli signal peptides contain a 
hydrophobic part beginning at position -6 (Figure 2A). Here, small residues appear to be important too 
Additional preference for hydrophobic residues can be found at the position -1 and -3. where position -3 
appears to be less hydrophobic than -!. The positions -3 and -1 are also dominant with regard to the 
properties polarity and volume, small apolar amino acids seem to play an essential role for the clcavagc-site 
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Cleavage site 
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A E.cofi signal peptides 

B Chloroplast transit peptides 

C Mitochondrial prepeptides 
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Hydrophilicity 
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Volume 
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Hydrophilicity 
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Hydrophobicity 
Hydrophilicity 
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Volume 



-1.0 <= weight < -0.5 
-0.5 <= weight < 0.0 
0.0 weight < 0.5 
0.5 <« weight <= l.o 



f^aS^^ The (12 x 4 ) 

of precursors (A, B. C). The cleavaee-sii«fr, „rf i^lZ t ex,racled b >L ? e ncural tlkcr s for the three types 
+ 1 (N-terminal «d of L mamrfprS * ^ ^ ^ matUrC pr °' cin Stans at P° s ' u ™ 



s, nal. In pos.uon -2 there is a requirement for large non-hydrophilic side chains. We regard position -2 as 

™^ A r in ' y hydr ° Ph0biC ««••"« « * - *e positions -3 to -I. PrevLs 

mvesnga tons already .dent.fied this "hydrophobic core" and the -.3.., rule" (7, 9 ..5 o 0 3 1) Bvhelpof 

22 : toTr ' S P0$SibIe " 0W - P ° ,ar rCSidUCS 31 ° f ^ ~ P- n appear ,0 

contribute to the cleavage-sue signal, too. 

Description of chloroplast transit peptide cleavage-site patterns.' The cleavage-site reeions in 
chloroplast trans.t peptides (Figure 2B) do not have two separate regions as in precursors. Besides a 
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cluster of extreme values at the positions -5 to -I no important parts can be identified h a . 
i -dues appear to be dominant at .3 and -5, separated by a strong* hydro^^ 

Position -2 a ,ac*c of hydrophobicity and polarity, at -1 small residues seem " ^ At 

Description of mitochondria! propeptide cleavagcsile patterns: The positions 8-4 1,. 
show extreme weights in the mitochondrial precursor sequences (Figure 2C) Posit on 8 ^ + ' 

™unly hydrophobic. The positions -4 and -1 clearly lack hyinn^J^jT " * 

(1 1) cannot be confirmed. ^ harfiC (Alg) al P° siti °n -2 

DISCUSSION 

«iid if u,e p^ietll I ' T* '° PM,C ° I " se,l,e " ce pos " i0 "' ™ s """"f™ * «"y 

^cz^^rtT 15 "' ,00% ■*^«-*''-«-».-*m, 

JL. 7c!;? T re ' a "°° s of °" Hl "">°^~« « agrec wi ,„ 

analyze* ve,y iTOnsiv :; t , 8 ", T" Md •»« 

expels replacing lhe nalivt d ' ' , T ' ^=>«-".»«gen«is 

WU model S ea„c„ c e s „„ F f° a ' ka,mc PKOSP""*" by a series of Idealised 

hydrops core. Z Z Z"^—^"'"' V™ " " "~ " » «•"»■ - 

propel earae u , ^,^^^^^^^7 ^ ^" " ^ *" ' 

PROSA-Design algorithm nn\ P«n rt • ,u »■ biocomputmg results by our 

where a s J tt ^cw ^T? "inton-diagram. the hydrophobic core stans at position -6. 

^inmg and L 1/ d ^ " CSSCmial ^ ^ G,y " C ^ P "»" both 
5 icsi aaia are classified with 100% accuracy (Table 2. rhf «hi,;„„,< f 

being generally valid. ' bl£ " nCd featUres can «* regarded as 
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Features of chloroplast transit peptide cleavage-sites: Chloroplas, targeting sequences contain a 

great number of hydroxylated serine and threonine residues (32). The cleavage-site can be described by a 

rather loosely defined consensus sequence: (Val/IIc)-X-(AIa/Cys)Ula (X means anv amino acid) arginine, 

are frequently found in the positions -6 ,o -10 (33). A comparison with the Himon-diagram (Fioure 2B) 

confirms the c.eavage-site rule (high hydrophobic!^ a. position -3) but in addition position -2 can'now be 

precisely characterized by a non-hydrophobic and non-polar residue which does no, suit to arginine (Table 

I). The assumed equivalent role of va.ine and iso.eucine a, -3 is not surprising because these two amino 

acids resemble closely in their physico-chemical properties (Tab.e 1 ). On the other hand, an equivalent role 

of alanme and cys.ein a, position -I canno, be explained by a similarity in their properties except for 
volum. „ may be that cyslein and alanjne hav£ somc other propen . es ^ cQmmon wwch ^ ^ ^ 

Table 1 a. -1 the weigh, facors of ,he ,rained Pcrcep.ron did no, reach exueme values in three properties 
except or volume (Figure 2b). A very small size will fa for alanine, bu, less well for cvsteine which is 
arger ,han alanine, glycine and serine. According to the Hinton-diagram. a new chloroplas, cleav.-e-.te 
feauire can be identified which could be described by a "-3,5 box" that allows very hydrophobic 'amino 
.«* only, while in position -4 very hydrophi,ic amino acids seem to be preferred. The chloroplas, 
cleavage-sue rule based on the known experimental and statistical results - including those described here - 
does no, seem to be homologous to the rule found for eubacerial cleavage-shes. The Percep.ron did no, 
find a generalizing feature as the quality index for the tes, set is only 72% (Table 2). Since all hitherto 
earned ou, analyses of chloroplast targeting sequences use only a single se: of sequences - which is 
analogue to our single training set - and no test set a, all. i, is no, surprising tha, our results do no, 
contradtc, known facts. Funher optimization of the neural network architecture is essential to obtain 
generally vaL features (16. 30). Anyway, i, was possible to obtain several results consistent with known 
experimental facts and apan from that some more possible characteristics have been found. Site directed 
mutagenesis experiments should clarify these uncertainties. 

Features of mitochondrial prepeptide cleavage-sites: Mitochondria, cleavaee-sites for the 
recogmnon by the major matrix protease (protease I in hi gher cukaryotes) have ^ ^ & 

success by the Perceptron than the ch.oroplasts' transit peptide c.eavage-si.es (Table 2). The obtained 
square error of 4.89 is rather high compared to the E.coli result. Thus, some care has ,o be taken with the 
interpretation of the Hinton-diagram (Figure 2Q. Position -8 is found to be occupied by stron- 
hydrophobic residues which is confirmed by literature (34). Further, in + | small and apolar amino acids 

fit b snot h° VCfy hyd 7? biC " " hydr ° PhiliC 10 bC SuCh * *"riP*» 

Hts best to the ammo acd g.ycne according to Tab.e 1. The Hinton-diagram may be interpreted further in 

hvd o7h r r on ; cl T se * site n,le by claimins an "- 4 - 1 box " which is charac,eri - d * W «»- 

hydrophobic residues. Several features of the different cleavage-site classes have been identified bv our 
Perceptron system. Especially the choice of a description of primary structures in terms of phy'sico- 
chemical property scales was a necessary prerequishc and led ,o intertable results. It is dearly shown 
«ha, even a si„g.e-uni, ne,work is able to extract biochemically comprehensive rules from molecular 
sequence data. Probably, the results for chloroplast., and mitochondrial cleava.e-sites will be improved i; 
more properties are concerned and feature extraction is performed by more powerful multilayer networks. 

^D- The project ha^^^^^ 
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